Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

نویسندگان

  • Soha Ahmed
  • Mengjie Zhang
  • Lifeng Peng
چکیده

Biomarker discovery using mass spectrometry (MS) data is very useful in disease detection and drug discovery. The process of biomarker discovery in MS data must start with feature selection as the number of features in MS data is extremely large (e.g. thousands) while the number of samples is comparatively small. In this study, we propose the use of genetic programming (GP) for automatic feature selection and classification of MS data. This GP based approach works by using the features selected by two feature selection metrics, namely information gain (IG) and relief-f (REFS-F) in the terminal set. The feature selection performance of the proposed approach is examined and compared with IG and REFS-F alone on five MS data sets with different numbers of features and instances. Naive Bayes (NB), support vector machines (SVMs) and J48 decision trees (J48) are used in the experiments to evaluate the classification accuracy of the selected features. Meanwhile, GP is also used as a classification method in the experiments and its performance is compared with that of NB, SVMs and J48. The results show that GP as a feature selection method can select a smaller number of features with better classification performance than IG and REFS-F using NB, SVMs and J48. In addition, GP as a classification method also outperforms NB and J48 and achieves comparable or slightly better performance than SVMs on these data sets.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Selection for Small Sample Sets with High Dimensional Data Using Heuristic Hybrid Approach

Feature selection can significantly be decisive when analyzing high dimensional data, especially with a small number of samples. Feature extraction methods do not have decent performance in these conditions. With small sample sets and high dimensional data, exploring a large search space and learning from insufficient samples becomes extremely hard. As a result, neural networks and clustering a...

متن کامل

A Multi-objective Genetic Programming Biomarker Detection Approach in Mass Spectrometry Data

Mass spectrometry is currently the most commonly used technology in biochemical research for proteomic analysis. The main goal of proteomic profiling using mass spectrometry is the classification of samples from different clinical states. This requires the identification of proteins or peptides (biomarkers) that are expressed differentially between different clinical states. However, due to the...

متن کامل

A New Hybrid Feature Subset Selection Algorithm for the Analysis of Ovarian Cancer Data Using Laser Mass Spectrum

Introduction: Amajor problem in the treatment of cancer is the lack of an appropriate method for the early diagnosis of the disease. The chemical reaction within an organ may be reflected in the form of proteomic patterns in the serum, sputum, or urine. Laser mass spectrometry is a valuable tool for extracting the proteomic patterns from biological samples. A major challenge in extracting such ...

متن کامل

Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine

We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...

متن کامل

تعیین ماشین‌های بردار پشتیبان بهینه در طبقه‌بندی تصاویر فرا طیفی بر مبنای الگوریتم ژنتیک

Hyper spectral remote sensing imagery, due to its rich source of spectral information provides an efficient tool for ground classifications in complex geographical areas with similar classes. Referring to robustness of Support Vector Machines (SVMs) in high dimensional space, they are efficient tool for classification of hyper spectral imagery. However, there are two optimization issues which s...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013